The goal of topic modeling is to find the concepts that run through documents by analyzing the words of the texts. Topic modeling is a powerful technique for analysis of a huge collection of documents. Topic modeling is used for discovering hidden structure from a collection of documents. A topic model is considered a powerful tool similar to clustering or classification approaches. It can model objects as latent topics that reflect the meaning of a collection of documents. The topic is viewed as a recurring pattern of co-occurring words. A topic includes a group of words that often occurs together. Topic modeling can link words with the same context and differentiate across the uses of words with different meanings. Some common methods of Topic Modeling include Vector Space Model (VSM), Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). (1)
Approached two different ways:
Generative model (only information relevant to the model is the number of times words are produced)-bag of words assumption
Problem of statistical interference
Topic modeling is a technique to extract abstract topics from a collection of documents. In order to do that, the input document-term matrix is usually decomposed into 2 low-rank matrices: the document-topic matrix and topic-word matrix.
Vector space model is a typical solution for keyword searches. It represents the document as a vector where each entry corresponds to a different word and the number at that entry corresponds to how many times that word is present in the document. (4)
Latent Semantic Analysis (LSA) is an approach to automatic indexing and information retrieval that attempts to overcome some problems with VSM by mapping documents as well as terms to a representation in the so-called latent semantic space. LSA usually takes the (high-dimensional) vector space representation of documents based on term frequencies as a starting point and applies a dimension reducing linear projection. The specific form of this mapping is determined by a given document collection and is based on a Singular Value Decomposition (SVD) of the corresponding term/document matrix. The general claim is that similarities between documents or between documents and queries can be more reliably estimated in the reduced latent space representation than in the original representation. The rationale is that documents which share frequently co-occurring terms will have a similar representation in the latent space, even if they have no terms in common. LSA thus performs some sort of noise reduction and has the potential benefit to detect synonyms as well as words that refer to the same topic. In many applications this has proven to result in more robust word processing. (6)
LSI (Latent Semantic Indexing) is a way that search engines determine whether your content is really on-topic and in-depth or just spam. The search engines determine this by looking at the words in an article and deciding how relevant they are to each other.
An example is a web search of “windows.” If you are searching for “windows”, there are hundreds of related keywords that you can think of:
“Bill Gates”
“Microsoft”
“Windows 10”
“Surface tablet”
These keywords are naturally grouped together and rightly so as these are the potential LSI keywords when writing a post about “windows.”
LSI also helps to differentiate from the “other” windows:
“Window cleaning”
“Double glazed windows”
“Wooden windows”
“Window locks” (7)
Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous. (9)
Latent Dirichlet Allocation (LDA) treats each document as a mixture of topics and each topic as a mixture of words. LDA estimates both the mixtures at the same time and creates a mixture of topics that describe each document.
This model works by comparing vectors representing entire documents and queries. If there are t total terms, then the information from each document Di is contained in the following vector of t dimensions (1):
\[ D_i = (d_{i,1}, d_{i,2}, ... , d_{i,t}) \] This is the document-term matrix. However, we typically like to work with the tf_idf weighting measure that we have already discussed in class to reflect the importance of each word (2).
Recall that tf, term frequency, is simply the number of times each term appears in a document and that idf, inverse document frequency, concerns the number of times a term appears across all documents. The idf value is calculated as follows.
\[ \mbox{idf}(t,D) = \log\left(\frac{N}{n_{t}}\right) \] The tf-idf value is then calculated as follows.
\[ \mbox{tf-idf}(t,d,D) = \mbox{tf}(t,d) \cdot \mbox{idf}(t,D) \]
Words with high tf-idf values are used often in a document, but not very often throughout all the documents. We can then use methods like cosine similarity to compare document to document or query to document (2).
\[ cos(q,d) = \frac{q.d}{||q||.||d||} \] We then rank the documents according to their similarity to the query. The cosine similarity concept can also be shown visually.
Consider this example from the University of Ottawa (4). There are three documents that only have three words each.
The tf-idf values are calculated according to the formula above. Below is an example calculation for “new.” The word appears in two documents, once in the first and once in the second. This is a simple example because the term frequencies per document are all either 1 or 0, so the tf-idf values only depend on idf. Below are the calculations.
new <- log2(3/2)
york <- log2(3/2)
times <- log2(3/2)
post <- log2(3/1)
los <- log2(3/1)
angeles <- log2(3/1)
The tf-idf matrix is then a collection of vectors representing each document.
\[ document = (new, york, times, post, los, angeles) \]
(doc1 <- c(new*1, york*1, times*1, post*0, los*0, angeles*0))
## [1] 0.5849625 0.5849625 0.5849625 0.0000000 0.0000000 0.0000000
(doc2 <- c(new*1, york*1, times*0, post*1, los*0, angeles*0))
## [1] 0.5849625 0.5849625 0.0000000 1.5849625 0.0000000 0.0000000
(doc3 <- c(new*0, york*0, times*1, post*0, los*1, angeles*1))
## [1] 0.0000000 0.0000000 0.5849625 0.0000000 1.5849625 1.5849625
How similar are the documents?
(doc1 %*% doc2)/(sqrt(sum(doc1^2))*sqrt(sum(doc2^2)))
## [,1]
## [1,] 0.3778002
(doc1 %*% doc3)/(sqrt(sum(doc3^2))*sqrt(sum(doc1^2)))
## [,1]
## [1,] 0.1457895
(doc2 %*% doc3)/(sqrt(sum(doc2^2))*sqrt(sum(doc3^2)))
## [,1]
## [1,] 0
Which document is most similar to the “new new times” query? To calculate tf-idf for a query, divide the frequency for that word by the maximum frequency for any word.
(query = c(new *(2/2), york*0, times*(1/2), post*0, los*0, angeles*0))
## [1] 0.5849625 0.0000000 0.2924813 0.0000000 0.0000000 0.0000000
(doc1 %*% query)/(sqrt(sum(doc1^2))*sqrt(sum(query^2)))
## [,1]
## [1,] 0.7745967
(doc2 %*% query)/(sqrt(sum(doc2^2))*sqrt(sum(query^2)))
## [,1]
## [1,] 0.2926428
(doc3 %*% query)/(sqrt(sum(doc3^2))*sqrt(sum(query^2)))
## [,1]
## [1,] 0.112928
As expected, document 1 would be the first result when searching this query.
In LSA, we use truncated singular value decomposition (SVD), a linear algebra technique that factorizes a matrix into three separate matrices. This method learns about the latent topics in the documents by performing matrix decomposition on the document-term matrix. It is intuitive that this matrix is very sparse and noisy, so we need to reduce dimensionality in order to find the relationships between words and documents. Like VSM, some people prefer to use the tf-idf values in the matrix. The formula for truncated SVD is as follows (5).
\[ A = U_tS_tV_t^T \] One way to think about this process is that we are keeping the t most important dimensions, where t is a number we choose ahead of time based on how many topics we want to extract.
The U matrix is our document-topic matrix and V is the topic-term matrix. The columns correspond to each of our topics. So if t is two, we keep two columns of each. With these matrices, we can then apply cosine similarity or other measures.
Consider the following documents (6).
mysvd <- svd(a)
(u <- mysvd$u)
## [,1] [,2] [,3]
## [1,] -0.4201216 -0.07479925 -0.04597244
## [2,] -0.2994868 0.20009226 0.40782766
## [3,] -0.1206348 -0.27489151 -0.45380010
## [4,] -0.1575610 0.30464762 -0.20064670
## [5,] -0.1206348 -0.27489151 -0.45380010
## [6,] -0.2625606 -0.37944687 0.15467426
## [7,] -0.4201216 -0.07479925 -0.04597244
## [8,] -0.4201216 -0.07479925 -0.04597244
## [9,] -0.2625606 -0.37944687 0.15467426
## [10,] -0.3151220 0.60929523 -0.40129339
## [11,] -0.2994868 0.20009226 0.40782766
(v <- mysvd$v)
## [,1] [,2] [,3]
## [1,] -0.4944666 -0.6491758 -0.5779910
## [2,] -0.6458224 0.7194469 -0.2555574
## [3,] -0.5817355 -0.2469149 0.7749947
(s <- diag(mysvd$d))
## [,1] [,2] [,3]
## [1,] 4.098872 0.000000 0.000000
## [2,] 0.000000 2.361571 0.000000
## [3,] 0.000000 0.000000 1.273669
u %*% s %*% t(v)
## [,1] [,2] [,3]
## [1,] 1.000000e+00 1.000000e+00 1.000000e+00
## [2,] 1.387779e-15 1.000000e+00 1.000000e+00
## [3,] 1.000000e+00 -4.996004e-16 1.720846e-15
## [4,] 6.383782e-16 1.000000e+00 -9.992007e-16
## [5,] 1.000000e+00 -4.718448e-16 1.554312e-15
## [6,] 1.000000e+00 -8.534840e-16 1.000000e+00
## [7,] 1.000000e+00 1.000000e+00 1.000000e+00
## [8,] 1.000000e+00 1.000000e+00 1.000000e+00
## [9,] 1.000000e+00 -8.534840e-16 1.000000e+00
## [10,] 1.276756e-15 2.000000e+00 -1.998401e-15
## [11,] 1.276756e-15 1.000000e+00 1.000000e+00
As discussed earlier, U is our document-topic matrix and V is our topic-term matrix. If we want to look at two topics, we keep the first two columns of U and V and the first two rows and columns of S.
(u <- u[,1:2])
## [,1] [,2]
## [1,] -0.4201216 -0.07479925
## [2,] -0.2994868 0.20009226
## [3,] -0.1206348 -0.27489151
## [4,] -0.1575610 0.30464762
## [5,] -0.1206348 -0.27489151
## [6,] -0.2625606 -0.37944687
## [7,] -0.4201216 -0.07479925
## [8,] -0.4201216 -0.07479925
## [9,] -0.2625606 -0.37944687
## [10,] -0.3151220 0.60929523
## [11,] -0.2994868 0.20009226
(v <- v[,1:2])
## [,1] [,2]
## [1,] -0.4944666 -0.6491758
## [2,] -0.6458224 0.7194469
## [3,] -0.5817355 -0.2469149
(s <- s[1:2,1:2])
## [,1] [,2]
## [1,] 4.098872 0.000000
## [2,] 0.000000 2.361571
Let’s look for the query “gold silver truck” using the following formula (6).
\[ q = q^TU_tS_t^{-1} \]
(q <- matrix(c(0,0,0,0,0,1,0,0,0,1,1), 11, 1,byrow=TRUE))
## [,1]
## [1,] 0
## [2,] 0
## [3,] 0
## [4,] 0
## [5,] 0
## [6,] 1
## [7,] 0
## [8,] 0
## [9,] 0
## [10,] 1
## [11,] 1
(q <- as.vector(t(q) %*% u %*% solve(s)))
## [1] -0.2140026 0.1820571
We now have the coordinates of the query and the coordinates of each of the documents.
(doc1 <- v[1,])
## [1] -0.4944666 -0.6491758
(doc2 <- v[2,])
## [1] -0.6458224 0.7194469
(doc3 <- v[3,])
## [1] -0.5817355 -0.2469149
We now use the same process as before to find the most relevant document.
(doc1 %*% q)/(sqrt(sum(doc1^2))*sqrt(sum(q^2)))
## [,1]
## [1,] -0.05395084
(doc2 %*% q)/(sqrt(sum(doc2^2))*sqrt(sum(q^2)))
## [,1]
## [1,] 0.9909874
(doc3 %*% q)/(sqrt(sum(doc3^2))*sqrt(sum(q^2)))
## [,1]
## [1,] 0.4479595
Document two would be the top result for this query. This can also be shown visually.
The PLSA model was meant to improve upon LSA by adding probabilistic concepts. The model revolves around two main assumptions. Topic z is present in document d with probability \(P(z|d)\) and word w is present in topic z with \(P(w|z)\). The joint probability of seeing a document d and word w together is shown below (5).
\[ P(D|W) = P(D)\sum_z{P(Z|D)P(W|Z)} = \sum_z{P(Z)P(D|Z)P(W|Z)} \] The terms on the right side are the parameters of the PLSA model. While some are learned through direct observation of the corpus, others are treated as multinomial distributions and are calculated using a process called expectation-maximization (EM). The first formulation is called the asymmetric formulation and the other is the symmetric formulation. The second formulation is perfectly symmetric in entities, documents and words (7). The difference is that we start with the document and generate the topic and word with some probability in the first formulation. In the second, we start with a topic and then generate the document and the word. With the second formulation, there is a direct connection to LSA.
This connection makes clear that the only difference LSA and PLSA, as expected, is the inclusion of probabilistic concepts. The P(D|Z) term relates to U, the document-topic matrix and the P(W|Z) term relates to V, the topic-term matrix.
There are little to no openly available examples using PLSA. People interested in topic modeling seem to gravitate toward LSA or LDA.
LDA is the newest and most popular method for topic modeling. As each successive method built on the previous one, their usefulness and complexity both increased. We will try to explain the mathematical concepts behind LDA as best we can.
The main difference between PLSA and LDA is the incorporation of Bayesian concepts. Again, we consider that each document is made up of a number of topics and each topic is made up of a number of words. Integral to LDA is the use of Dirichlet priors for the document-topic and word-topic distributions. The Dirichlet Distribution is a generalization of the Beta Disribution and is often called a “distribution of distributions” (5). The general process is outlined in the figure below.
Like other methods, we choose the k number of topics we want to find. However, this time there are additional parameters \(\alpha\) and \(\beta\). While \(\alpha\) relates to the prior weight of topic k in a document, the other parameter \(\beta\) relates to the prior weight of word w in a topic. We usually set these to very low values such as 0.1 or 0.01 because we expect there to be few words per topic and few topics per document (8).
LDA is great at providing output that is easily understandable. For example, if we search an ESPN database we might find that Topic 1 is best represented by “NFL, Super, Bowl, football, coach, quarterback” and Topic 2 is represented by “NBA, LeBron, Steph, Warriors, coach.” If a new ESPN article is published, we could find the topic mixture for that article based on the information learned from our corpus. The ability to quickly interpret new articles is the main advantage over PLSA (5).
People who are working with LDA use software to produce quick results. There are multiple packages that allow you to run LDA in R with two of the main packages being lda and topicmodels. We will go through some examples during the next class.